Data import

Reading in data files

Daniela Palleschi

Humboldt-Universität zu Berlin

2023-05-08

Wiederholung

Last week we…

  • learned the difference between numerical and categorical data types
  • produced our first plots with ggplot2
  • learned which plots are appropriate for different data types

Heutige Ziele

Today we will…

  • learn how to eyeball a new data set
  • learn how to import different data types
  • learn how to enter data by-hand
  • visualise a new dataset

Lust auf mehr?

Exploring data

  • last week we worked with data provided in the R package palmerpenguings
    • data from packages are a great way to learn data science tools
    • but you’ll want to work with your own data at some point
  • today we’ll learn the basics of reading data files into R

Packages

  • we will start using the pacman package
    • the function p_load() takes package names as arguments
    • it then checks if you have the package install
      • if yes, then it loads the package
      • if not, then it installs and loads the package
  • this saves us from having to install new packages every time
# install new packages IN THE CONSOLE!
install.packages("pacman")
# load packages
pacman::p_load("tidyverse", # wrangling
               "janitor", # wrangling
               "here", # relative file paths
               "patchwork" # plot layout
               )
  • we now have tidyverse loaded, and the new packages janitor and here installed and loaded

Built-in data

  • let’s first explore some built-in data: the iris dataset
    • how many variables are there?
    • how many observations?
# load tidyverse
library(tidyverse)
# read-in iris dataset
data("iris")

Image source: Analytics Vidhya (all rights reserved)

Explore a dataset

  • View(): open dataset
    • only run this in the console! Your document won’t render if you have it in your script
  • the head() function prints the first 6 rows of the data
# print dataset (first 6 rows)
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

glimpse()

  • from the tibble package
  • gives a sideways preview of the dataframe
glimpse(iris)
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
$ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…

summary()

  • prints a summary of each variable
    • minimum, maximum, mean of numerical variables
    • number of observations per level of a categorical variable

Aufgabe

Aufgabe 1: table1

Example 1  

  1. Try loading the built-in dataset table1
  2. Explore the dataset using the functions we just learned
  3. Is there anything funny with the summary? How might we fix that?

Importing data

  • we usually want to work with our own data, not built-in toy data
  • we will focus only on rectangular data (i.e., tidy data)
  • there are many different file types that data can take, e.g., .xlsx, .txt, .csv, .tsv
    • try to avoid .xlsx; if you have an Excel dataset try saving it as a .csv before reading it into R
  • csv is the most common data file type: Comma Separated Values
  • this is what a simple CSV file looks like
Student ID,Full Name,favourite.food,mealPlan,AGE
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,Barclay Lynn,French fries,Lunch only,5
3,Jayendra Lyne,N/A,Breakfast and lunch,7
4,Leon Rossini,Anchovies,Lunch only,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,Güvenç Attila,Ice cream,Lunch only,6
  • the first row (the “header row”) contains the columns names

  • the subsequent rows contain the data

  • how many variables are there? how many observations?

readr package

  • the same data can be viewed as a table, just like we did with penguins and iris
    • but first we have to read in the data
  • the readr package (part of tidyverse) can load in most data types
read_csv(here::here("daten", "students.csv"))
Table 1: Data from the students.csv file as a table.
Student ID Full Name favourite.food mealPlan AGE
1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 Barclay Lynn French fries Lunch only 5
3 Jayendra Lyne N/A Breakfast and lunch 7
4 Leon Rossini Anchovies Lunch only NA
5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 Güvenç Attila Ice cream Lunch only 6

Aufgabe

Aufgabe 2: table1

Example 2  

  1. Import the students.csv dataset and save it as an object called df_students
    • df_ is short for DataFrame; it’s a good idea to use a prefix before object names so we know what each object contains
  2. When importing data with read_csv, some information is printed in the Console. What is printed?
  3. Explore the dataset using the functions we just learned
  4. Do you see anything odd?

The here package

  • how did R know exactly where to find the daten folder?
  • our working directory is set to the location of our RProject within our computer
  • whenever we want to access data in our RProject, we should use here::here()
  • to see where here() is starting from, run here()
here()
[1] "/Users/danielapalleschi/Documents/IdSL/Teaching/SoSe23/BA/ba_daten"
  • this will look different for all of our machines
    • but what should be the same is our folder structure within our projects (e.g., daten/students.csv)

here package

Image source: Allison Horst (all rights reserved)

here Paket

Before the here package, we used to have to explicitly tell R where on our computer a file was located (e.g., /Users/danielapalleschi/Documents/IdSL/Teaching/SoSe23/BA/ba_daten/daten/students.csv), or use the setwd() (set Working Directory) function to tell R where to assume where all files are located (e.g., setwd(/Users/danielapalleschi/Documents/IdSL/Teaching/SoSe23/BA/ba_daten)). Luckily, you never need to use these absolute file paths or setwd()!

From the here package documentation:

The goal of the here package is to enable easy file referencing in project-oriented workflows. In contrast to using setwd(), which is fragile and dependent on the way you organize your files, here uses the top-level directory of a project to easily build paths to files.

This means we now have the huge benefit of being able to move our project folder anywhere, and our file path will still be relative to wherever we’ve moved our project fold. This mean the project runs independent of where on your computer it is located. You can also send somebody the project folder, and everything should run on their machine!

Missing values

  • data transformation refers to ‘fixing’ our data when it’s not ‘tidy’
  • in our df_students dataframe, you might’ve noticed some NA or N/A values
    • N/A was written as text, and so R reads it as such
    • NAs in R refer to missing data (“Not Available”)
  • last week we saw some warning messages when we created our scatterplots
    • these warnings were telling us about missing values (NAs) that were not plotted
  • true missing values are completely empty, so having N/A written in our df_students data is not actually read as a missing value
  • to fix this, we can use the argument na = for the read_csv() function
    • this argument tells read_csv() which values it should equate with missing values

Missing values

df_students <- read_csv(here::here("daten", "students.csv"),
                        na = "N/A")
head(df_students)
# A tibble: 6 × 5
  `Student ID` `Full Name`      favourite.food     mealPlan            AGE   
         <dbl> <chr>            <chr>              <chr>               <chr> 
1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          "4"   
2            2 Barclay Lynn     French fries       Lunch only          "5"   
3            3 Jayendra Lyne    <NA>               Breakfast and lunch "7"   
4            4 Leon Rossini     Anchovies          Lunch only          ""    
5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch "five"
6            6 Güvenç Attila    Ice cream          Lunch only          "6"   
  • now the value that was formerly “N/A” is read as an NA
    • but what about the empty cell?
  • we have now overwritten read_csv() reading empty cells as NA
    • how can we tell read_csv() to read more than one type of input as NA?
    • i.e., we want to tell it to read "" and "N/A" as NA

Missing values

df_students <- read_csv(here::here("daten", "students.csv"),
                        na = c("N/A",""))
head(df_students)
# A tibble: 6 × 5
  `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
         <dbl> <chr>            <chr>              <chr>               <chr>
1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
2            2 Barclay Lynn     French fries       Lunch only          5    
3            3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
6            6 Güvenç Attila    Ice cream          Lunch only          6    

Column names

  • if we print df_students in the Console we’ll see that the first two columns names are surrounded by backticks (e.g., `Student ID`)
    • this is because they contain an empty space, which is not syntactically valid (variable names need to start with a letter and not contain spaces or special characters)
  • a quick fix is to the function clean_names() from the janitor package
janitor::clean_names(df_students)
# A tibble: 6 × 5
  student_id full_name        favourite_food     meal_plan           age  
       <dbl> <chr>            <chr>              <chr>               <chr>
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
2          2 Barclay Lynn     French fries       Lunch only          5    
3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
6          6 Güvenç Attila    Ice cream          Lunch only          6    
  • that looks better!
    • but if you now run head(df_students), do you see the cleaned column names?

Column names

  • when we pass an object through a function, the object is not ‘updated’
    • so we have to again assign the object
df_students <- janitor::clean_names(df_students)
  • but we often know that we want to run multiple functions (read_csv(), clean_names()) on the same object
    • we can do that using pipes

Pipes

  • pipes are placed at the end of function call when the result of this function should be passed through a subsequent function
    • they can be read as “and then…”
read_csv(here::here("daten", "students.csv")) %>%
  head()
# A tibble: 6 × 5
  `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
         <dbl> <chr>            <chr>              <chr>               <chr>
1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
2            2 Barclay Lynn     French fries       Lunch only          5    
3            3 Jayendra Lyne    N/A                Breakfast and lunch 7    
4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
6            6 Güvenç Attila    Ice cream          Lunch only          6    
  • there are currently 2 pipes that can be used in R
    1. the magrittr package pipe: %>%
    2. the new native R pipe: |>
  • so far I haven’t discovered any major difference between the two, but for now I’m sticking with %>%

Aufgabe

Aufgabe 3: pipes

Example 3  

  1. Load the students.csv dataset again with fixed NAs and then
    • Use a pipe to call clean_names() on the dataset, and then
    • call the head() function
  2. Load the students.csv dataset again with fixed NAs, saving it as the object df_students, and then
    • use a pipe to call clean_names() on the data set
  3. Why shouldn’t you use a pipe and the head() function when you’re saving the dataset as an object?

Other file types

  • once you’re comfortable with read_csv(), readr’s other functions are easy to use
    • you just have to know when to use which ones
  • read_csv2() reads semicolon-separated files
    • These use ; instead of , to separate fields and are common in countries that use , as the decimal marker
  • read_tsv() reads tab-delimited files
  • read_delim() reads in files with any delimiter
    • will try to guess the delimiter unless you specify it with the argument delim = (e.g., read_delim(students.csv, delim = ","))

Others I haven’t yet needed:

  • read_fwf() reads fixed-width files
  • read_table() reads a common variation of fixed-width files where columns are separated by white space
  • read_log() reads Apache-style log files

Aufgabe

Aufgabe 4: filetypes

Example 4  

  1. What function would you use to read a file where fields were separated with “|”?
  2. What arguments do read_csv() and read_tsv() have in common?
  3. How would you load in a dataset with a semicolon as delimiter?
  4. Load in the dataset nettle_1999_climate.csv
    • how many variables are there?
  5. Load in the dataset nettle_1999_climate2.csv
    • how many variables are there? Is this right?
  6. Load in the dataset nettle_1999_climate3.csv
    • how many variables are there? Is this right?

Data entry

  • when collecting small amounts of data, you may want to enter them by hand into R
    • there are two useful functions that help us take collected data and create a tibble
    • tibbles are modern dataframes, don’t worry about the definition of a tibble just yet
  • let’s collect everybody’s initials, heights (cm), and birthday (ddmm)
i <- "DP"
h <- 171
b <- 0705

tibble()

tibble(
  initial = i,
  height = h,
  bday = b)
# A tibble: 1 × 3
  initial height  bday
  <chr>    <dbl> <dbl>
1 DP         171   705

tribble()

  • it might be easier to enter data row-by-row
    • this is possible with a transposed tibble (tribble)
tribble(
  ~initial, ~height, ~month, ~day,
  "DP", 171, 07, 05
)
# A tibble: 1 × 4
  initial height month   day
  <chr>    <dbl> <dbl> <dbl>
1 DP         171     7     5

Aufgabe

Aufgabe 6: tibbles

Example 5  

  1. Save the tibble (i.e., data frame) as the object df_wir
  2. Explore the dataset (e.g., print summaries)
  3. What seems strange to you?
   initial              height        month        day   
 Length:1           Min.   :171   Min.   :5   Min.   :7  
 Class :character   1st Qu.:171   1st Qu.:5   1st Qu.:7  
 Mode  :character   Median :171   Median :5   Median :7  
                    Mean   :171   Mean   :5   Mean   :7  
                    3rd Qu.:171   3rd Qu.:5   3rd Qu.:7  
                    Max.   :171   Max.   :5   Max.   :7  

Working with variables

  • in a dataframe, variables are organised in columns

Variable types

  • readr guesses the type of data each column contains
    • the major columns types to know are numerical and factor
  • factors contain categories or groups of data, but can sometimes look like numerical data
    • for example, our column month contains numbers, but it could also contain the name of each month
    • a good way to know which is which: it makes sense to calculate the mean of a numerical variable, but not of a factor
      • it makes sense to calculate our mean height, but not our mean birth month
df_wir$month <- as_factor(df_wir$month)

Indexing

  • sometimes we want to access a certain variable (column) in a data frame
    • using base R, we do that with $: dataframe$variable
df_wir$height
[1] 171
  • and we can use this as the argument of a function
    • try finding the minimum and maximum heights in our group
    • calculate the sum of our heights

Aufgabe

Aufgabe 6: tibbles

Example 6  

  1. Convert df_wir$month to a factor
  2. Compute our mean height using the function mean() and indexing
  3. Compute the sum of our heights.

Write to file

  • we can also save our dataframe so that we can come back to it later with write_csv(object, "filename")
  • be very careful! If you use an existing filename you can overwrite a dataset you already have
write_csv(df_wir, file = here("daten", "wir.csv"))

Exercises

Here are some more in-depth exercises.

Import/export

  1. Load in the built-in starwars dataset, that contains information about Star Wars characters.
  2. Export the data as a csv file called starwars in your daten folder.
  3. Import the starwars.csv data file using read_csv()

Convert column types

  1. Convert the following variables to factors:
    • hair_color
    • skin_color
    • eye_color
    • sex
    • gender
    • homeworld
    • species

Plots

  1. Produce the following three plots, and briefly describe what they show and any conclusions that can be drawn from them.

  2. Create another plot of your choosing from the starwars dataset. Add it to the plot grid (you’ll have to adjust the syntax). Describe what it shows.

(fig_sw_height + fig_sw_gender) /
  fig_sw_height_mass +
  plot_layout(nrow = 2,  heights = c(.4,.6))

Heutige Ziele 🏁

Heute haben wir…

  • learned how to eyeball a new data set ✅
  • learned how to import different data types ✅
  • learned how to enter data by-hand ✅
  • visualised a new dataset ✅

Session Info

Hergestellt mit R version 4.2.3 (2023-03-15) (Shortstop Beagle) und RStudioversion 2023.3.0.386 (Cherry Blossom).

sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] patchwork_1.1.2 here_1.0.1      janitor_2.2.0   lubridate_1.9.2
 [5] forcats_1.0.0   stringr_1.5.0   dplyr_1.1.1     purrr_1.0.1    
 [9] readr_2.1.4     tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.2  
[13] tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] RColorBrewer_1.1-3 pillar_1.9.0       compiler_4.2.3     tools_4.2.3       
 [5] bit_4.0.5          digest_0.6.31      timechange_0.2.0   jsonlite_1.8.4    
 [9] evaluate_0.20      lifecycle_1.0.3    gtable_0.3.3       pkgconfig_2.0.3   
[13] rlang_1.1.0        cli_3.6.1          rstudioapi_0.14    parallel_4.2.3    
[17] yaml_2.3.7         xfun_0.38          fastmap_1.1.1      withr_2.5.0       
[21] knitr_1.42         generics_0.1.3     vctrs_0.6.1        hms_1.1.3         
[25] bit64_4.0.5        rprojroot_2.0.3    grid_4.2.3         tidyselect_1.2.0  
[29] snakecase_0.11.0   glue_1.6.2         R6_2.5.1           fansi_1.0.4       
[33] vroom_1.6.1        rmarkdown_2.21     pacman_0.5.1       farver_2.1.1      
[37] tzdb_0.3.0         magrittr_2.0.3     scales_1.2.1       htmltools_0.5.5   
[41] colorspace_2.1-0   labeling_0.4.2     utf8_1.2.3         stringi_1.7.12    
[45] munsell_0.5.0      crayon_1.5.2      

Literaturverzeichnis

Nordmann, E., & DeBruine, L. (2022). Applied data skills (Version 2.0). Zenodo. https://doi.org/10.5281/zenodo.6365078
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (n.d.). R for Data Science (2nd ed.). https://r4ds.hadley.nz/